Prosodic modeling for improved speech recognition and understanding
نویسنده
چکیده
The general goal of this thesis is to model the prosodic aspects of speech to improve humancomputer dialogue systems. Towards this goal, we investigate a variety of ways of utilizing prosodic information to enhance speech recognition and understanding performance, and address some issues and difficulties in modeling speech prosody during this process. We explore prosodic modeling in two languages, Mandarin Chinese and English, which have very different prosodic characteristics. Chinese is a tonal language, in which intonation is highly constrained by syllable F0 patterns determined by lexical tones. Hence, our strategy is to focus on tone modeling and account for intonational aspects within the context of improving tone models. On the other hand, the acoustic expression of lexical stress in English is obscure and highly influenced by intonation. Thus, we examine the applicability of modeling lexical stress for improved speech recognition, and explore prosodic modeling beyond the lexical level as well. We first developed a novel continuous pitch detection algorithm (CPDA), which was designed explicitly to promote robustness for telephone speech and prosodic modeling. The algorithm achieved similar performance for studio and telephone speech (4.25% vs. 4.34% in gross error rate). It also has superior performance for both voiced pitch accuracy and Mandarin tone classification accuracy compared with an optimized algorithm in xwaves. Next, we turned our attention to modeling lexical tones for Mandarin Chinese. We performed empirical studies of Mandarin tone and intonation, focusing on analyzing sources of tonal variations. We demonstrated that tone classification performance can be significantly improved by taking into account F0 declination, phrase boundary, and tone context influences. We explored various ways to incorporate tone model constraints into the summit speech recognition system. Integration of a simple four-tone model into the first-pass Viterbi search reduced the syllable error rate by 30.2% for a Mandarin digit recognition task, and by 15.9% on the spontaneous utterances in the yinhe domain. However, further improvements by using more refined tone models were not statistically significant. Leveraging the same mechanisms developed for Mandarin tone modeling, we incorporated lexical stress models into spontaneous speech recognition in the jupiter weather domain, and achieved a 5.5% reduction in word error rate compared to a state-of-the-art baseline performance. However, our recognition results obtained with a one-class (including all vowels) prosodic model seemed to suggest that the gain was mainly due to the elimination of implausible hypotheses, e.g., preventing vowel/non-vowel or vowel/non-phone confusions, rather than by distinguishing the fine differences among different stress and vowel classes.
منابع مشابه
Prosody Modeling for Automatic Speech Recognition and Understanding
This paper summarizes statistical modeling approaches for the use of prosody (the rhythm and melody of speech) in automatic recognition and understanding of speech. We outline effective prosodic feature extraction, model architectures, and techniques to combine prosodic with lexical (word-based) information. We then survey a number of applications of the framework, and give results for automati...
متن کاملImproved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition
Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...
متن کاملA Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation
Abstract Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...
متن کاملA Frame-Synchronous Prosodic Decoder for Text-Independent Dialog Act Recognition
Dialog act (DA) recognition is an important intermediate task is speech understanding systems. Although past research has demonstrated that prosody can improve the performance of recognizers relying primarily on words, how prosody fares on its own is not well understood. The current work continues an ongoing investigation into settings in which both words and word boundaries are unavailable, wh...
متن کاملSpontaneous Mandarin Speech Recognition with Disfluencies Detected by Latent Prosodic Modeling (LPM)
In this paper, a new approach for improved spontaneous Mandarin speech recognition using Latent Prosodic Modeling (LPM) for disfluency interruption point (IP) detection is presented. The basic idea is to detect the disfluency interruption points (IPs) prior to the recognition, and then to incorporate these information into the recognition process via the second pass rescoring. For accurate dete...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001